scGate: marker-based purification of cell types from single-cell RNA-seq datasets

This demo illustrates the main functionalities of the scGate package for purifying cell type populations of interest from heterogeneous single-cell cell datasets. We start from basic, single-gene filters and move gradually to more complex hierarchical models composed of multi-gene signatures. We show these examples on public data, but you should be able to adapt them to your own single-cell datasets.

Setting up the environment

library(renv)
renv::activate()
renv::restore()

library(ggplot2)
library(dplyr)
library(patchwork)
library(viridis)
remotes::install_github("carmonalab/scGate",ref = "v1.0.0")
library(scGate)

Loading demo datasets

This helper function retrieves some toy datasets with cell type annotations, useful to test our models:

testing.datasets <- scGate::get_testing_data(version = "hsa.latest")

Let’s play with a (downsampled) dataset of PBMCs from Hao et al. 2021 (https://doi.org/10.1016/j.cell.2021.04.048)

obj <- testing.datasets[["Satija"]]
DimPlot(obj, label = T, repel = T, group.by = "celltype.l1") + theme(legend.position = "none",
    aspect.ratio = 1)

# ggsave('plots/umap.Hao.celltype.pdf',width = 4, height = 4)

Creating a simple gating model

Now let’s setup a simple scGate gating model to purify a population of interest - here B cells.

B cells are usually identified by the expression of CD20, encoded by MS4A1.

my_scGate_model <- gating_model(name = "Bcell", signature = c("MS4A1"))
my_scGate_model
  levels   use_as  name signature
1 level1 positive Bcell     MS4A1

Run scGate to purify B cells with the simple single-gene model

obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

# ggsave('plots/umap.Hao.Bcell.pdf',width = 4, height = 4)

Because in this demo example we know the ground truth cell types (at least, as defined by the authors), we can evaluate scGate predictive performance in terms of precision/positive predictive value, recall/sensitivity, and accuracy using Matthews Correlation Coefficient (MCC)

scGate::performance.metrics(grepl("^B ", obj$cell_type), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9967742 0.9967742 0.9961825 

With this very simple model, >99% (Recall) are isolated with a >99% purity (Precision), and overall classification performance (MCC) >99%

Another example: we can isolate plasmacytoid dendritic cells (pDCs), defined using the marker LILRA4 (e.g. https://pubmed.ncbi.nlm.nih.gov/30395816/)

my_scGate_model <- gating_model(name = "pDC", signature = c("LILRA4"))  # add one positive signature
my_scGate_model
  levels   use_as name signature
1 level1 positive  pDC    LILRA4
# run the model
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

scGate::performance.metrics(obj$cell_type == "pDC", obj$is.pure == "Pure")
PREC  REC  MCC 
   1    1    1 

Gating models with positive and negative markers

Natural killer cells (NKs) are characterized by the marker KLRD1. However, KLRD1 can also be expressed by some T cell subsets. To improve sensitivity to isolate NKs, we can include in our gating strategy “negative” T cell markers such as CD3D.

my_scGate_model <- gating_model(name = "NK", signature = c("NCAM1+", "KLRD1+", "CD3D-"))

obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

# ggsave('plots/umap.Hao.NK.pdf',width = 4, height = 4)

With this simple model, 95% of NKs (recall) are isolated with a 100% purity (Precision), and overall classification performance (MCC) of 97%

scGate::performance.metrics(grepl("^NK", obj$cell_type), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9581395 0.9716981 0.9607030 

Hierarchical gating models

In the examples above, we have been using a single-cell dataset derived from blood (PBMC). When working with more complex tissues, such as tumors, we might need to apply multiple levels of gating to isolate the population of interest. scGate allows to purify by steps, using a hierarchical gating model - for example, first purifying immune cells, and among the immune cells the cell type of interest.

Let’s explore (a downsampled version of) the whole tumor dataset by Jerby-Arnon et al. 2018 (https://doi.org/10.1016/j.cell.2018.09.006), with the cell type annotations provided by the authors.

obj <- testing.datasets[["JerbyArnon"]]
DimPlot(obj, label = T, repel = T, group.by = "cell_type") + theme(legend.position = "none",
    aspect.ratio = 1)

# ggsave('plots/umap.jerby.celltype.pdf',width = 4, height = 4)

The dataset comprises non-immune populations such as malignant/cancer cells (Mal), Endothelial cells (Endo) and cancer-associated fibroblasts (CAF). We can try to purify first all immune cells, using pan-immune cell marker CD45 (encoded by the gene PTPRC):

my_scGate_model <- gating_model(name = "immune", signature = c("PTPRC"))
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

# ggsave('plots/umap.jerby.immune.pdf',width = 4, height = 4)
scGate::performance.metrics(obj$cell_type %in% c("B.cell", "NK", "T.CD4", "T.CD8",
    "T.cell", "Macrophage"), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.8999310 0.9775112 0.8014946 

From the immmune cells, we can generate a simple gating model to purify macrophages (using common markers CD68 and FCGR1A).

Instead of purifying macrophages directly from the whole tissue, we can set up a hierarchial scGate model to: i) isolate immune cells as in the previous example, ii) isolate macrophages from immune cells.

Hierarchical gating models can be specified in scGate using parameter “level”, as follows:

my_scGate_model <- gating_model(name = "immune", signature = c("PTPRC"), level = 1)  # initialize model with one positive signature
my_scGate_model <- gating_model(model = my_scGate_model, name = "macrophage", signature = c("CD68",
    "FCGR1A"), level = 2)  # add positive signature at second step
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

# ggsave('plots/umap.jerby.macro.pdf',width = 4, height = 4)
scGate::performance.metrics(obj$cell_type %in% c("Macrophage"), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9068826 0.9218107 0.9023599 

Here we isolated 92% of macrophages with a 91% purity (according to annotation by Jerby-Arnon et al.). We could easily improve PREC and REC by adding more positive and negative markers, respectively (e.g. removing lymphocytes)

We can always inspect the distribution of UCell scores calculated for each signature. For example:

FeaturePlot(obj, features = c("macrophage_UCell")) + scale_color_viridis(option = "D")

# ggsave('plots/umap.jerby.macro.UCell.pdf',width = 4, height = 4)

We can see how the two-level gating model worked in each step using the function plot_levels. The first plot shows the purification step for CD45+ cells, the second plot the isolation of macrophages.

wrap_plots(plot_levels(obj))

Editing models

scGate models of arbitrary complexity can be easily constructed to achieve high purification performance. These gating models can be written using the gating_model() function, and manually edited using the fix() function.

You can also export your basic model as tabulated text and edit it manually in Excel.

fix(my_scGate_model)  # edit locally in R
write.table(my_scGate_model, "test.tsv", sep = "\t")  # export and then edit on Excel
my_scGate_model_edited <- scGate::load_scGate_model("test.tsv")  # tabulated-text model can loaded

Fast evaluation of model performance

We provide a built-in function test_my_model to automatically evaluate the performance of a gating model using 3 pre-annotated testing datasets:

my_scGate_model <- gating_model(name = "Bcell", signature = c("MS4A1"))
panBcell.performance <- test_my_model(my_scGate_model, target = "Bcell")

panBcell.performance$performance
                PREC       REC       MCC
JerbyArnon 0.8550725 0.9711934 0.8984003
Zilionis   0.9572193 0.9521277 0.9499812
Satija     0.9967742 0.9967742 0.9961825

In one of the datasets, precision was not optimal (85%).

We can refine it, for instance, by removing potentially contaminating T cell genes (CD3D) and try again:

my_scGate_model <- gating_model(model = my_scGate_model, name = "Tcell", signature = c("CD3D"),
    negative = T)
panBcell.performance <- test_my_model(my_scGate_model, target = "Bcell")

panBcell.performance$performance
                PREC       REC       MCC
JerbyArnon 0.9446809 0.9135802 0.9193858
Zilionis   0.9572193 0.9521277 0.9499812
Satija     0.9967638 0.9935484 0.9942680

Predictive performance is now very high for the three sample datasets.

Using pre-defined gating models

We also provide some pre-defined gating models, these can be retrieved with the get_scGateDB() function:

models.DB <- scGate::get_scGateDB()

names(models.DB$human$generic)
 [1] "Bcell"           "CD4T"            "CD8T"            "MoMacDC"        
 [5] "Myeloid"         "NK"              "PanBcell"        "Plasma_cell"    
 [9] "Tcell"           "Tcell.alphabeta"

names(models.DB$mouse$generic)
[1] "CD4T"            "CD8T"            "MoMacDC"         "Myeloid"        
[5] "Tcell"           "Tcell.alphabeta"

Visualizing gating models

To visualize complex models in tree-like structures, we provide the plot_tree() function.

The following is a model to purify plasma (B) cells with high accuracy:

my_scGate_model <- models.DB$human$generic$Plasma_cell
plt.tree <- scGate::plot_tree(my_scGate_model)
plt.tree

At each level in the tree, UCell scores are evaluated both for positive and negative signatures. Only cells with sufficiently high UCell scores for at least one positive signature, and consistently low UCell scores for all negative signatures will be passed on to the next level in the tree. Upon reaching the final branch of the tree, only cells that passed all gating levels are labeled as “Pure” for the population of interest.

We can run the Plasma cell model on the whole-tumor dataset by Zilionis et al. (https://doi.org/10.1016/j.immuni.2019.03.009)

obj <- testing.datasets[["Zilionis"]]
DimPlot(obj, label = T, repel = T, group.by = "cell_type") + theme(legend.position = "none",
    aspect.ratio = 1)

Run scGate

obj <- scGate(obj, model = my_scGate_model)
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

Visualize filtering results by level. We can see how the purity increases at each level.

plots <- scGate::plot_levels(obj)
wrap_plots(plots, ncol = 2)

DimPlot(obj, group.by = "cell_type", label = T, repel = T, label.size = 3) + ggtitle("Author's manual annot") +
    NoLegend()

Evaluate performance of filtering compared to original annotation

scGate::performance.metrics(actual = obj$Plasma_cell, pred = obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9255319 0.9456522 0.9323972 

Further reading

The scGate package and installation instructions are available at: scGate package

The code for this demo can be found on GitHub

References

  • Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., … & Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell.

  • Murray, L., Xi, Y., & Upham, J. W. (2019). CLEC4C gene expression can be used to quantify circulating plasmacytoid dendritic cells. Journal of immunological methods, 464, 126-130.

  • Jerby-Arnon, L., Shah, P., Cuoco, M. S., Rodman, C., Su, M. J., Melms, J. C., … & Regev, A. (2018). A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell, 175(4), 984-997.

  • Zilionis, R., Engblom, C., Pfirschke, C., Savova, V., Zemmour, D., Saatcioglu, H. D., … & Klein, A. M. (2019). Single-cell transcriptomics of human and mouse lung cancers reveals conserved myeloid populations across individuals and species. Immunity, 50(5), 1317-1334.